Vehicle Attributes and Effects on MPG
The purpose of this study to was to create a linear regression model based on vehicle data from 1985 Ward’s Automotive Yearbook in hopes of uncovering possible relationships between MPG (miles per gallon) and other vehicle attributes.
Through separating city and highway MPG, I hoped to uncover possible differences between how the two rates are impacted by the different vehicle attributes given in the data.
The dataset in this study contains 205 observations, 6 of which were removed due to missing values, leading to grand total of 199 observations. Additionally, the dataset originally contained 26 variables, but I chose to only study 15 of these, subetting the data based upon high amounts of missing values and relevancy to the research question.
| highway | city | fueltype | aspiration | wheelbase | length | width | height | curbweight | enginesize | bore | stroke | compressionratio | horsepower | peakrpm |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 27 | 21 | gas | std | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | 130 | 3.47 | 2.68 | 9.0 | 111 | 5000 |
| 27 | 21 | gas | std | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | 130 | 3.47 | 2.68 | 9.0 | 111 | 5000 |
| 26 | 19 | gas | std | 94.5 | 171.2 | 65.5 | 52.4 | 2823 | 152 | 2.68 | 3.47 | 9.0 | 154 | 5000 |
| 30 | 24 | gas | std | 99.8 | 176.6 | 66.2 | 54.3 | 2337 | 109 | 3.19 | 3.40 | 10.0 | 102 | 5500 |
| 22 | 18 | gas | std | 99.4 | 176.6 | 66.4 | 54.3 | 2824 | 136 | 3.19 | 3.40 | 8.0 | 115 | 5500 |
| 25 | 19 | gas | std | 99.8 | 177.3 | 66.3 | 53.1 | 2507 | 136 | 3.19 | 3.40 | 8.5 | 110 | 5500 |
| 25 | 19 | gas | std | 105.8 | 192.7 | 71.4 | 55.7 | 2844 | 136 | 3.19 | 3.40 | 8.5 | 110 | 5500 |
| 25 | 19 | gas | std | 105.8 | 192.7 | 71.4 | 55.7 | 2954 | 136 | 3.19 | 3.40 | 8.5 | 110 | 5500 |
| 20 | 17 | gas | turbo | 105.8 | 192.7 | 71.4 | 55.9 | 3086 | 131 | 3.13 | 3.40 | 8.3 | 140 | 5500 |
| 22 | 16 | gas | turbo | 99.5 | 178.2 | 67.9 | 52.0 | 3053 | 131 | 3.13 | 3.40 | 7.0 | 160 | 5500 |
The following explanatory variables were the focus of our analysis:
Fuel Type: gas or diesel
Aspiration: standard (std) or turbo
Wheel Base: the horizontal distance (in.) between the centers of the front and rear wheels
Length: length (in.) of vehicle
Width: width (in.) of vehicle
Height: height (in.) of vehicle
Curb-weight: the published weight (lbs.) of a vehicle with a full tank of fuel and all fluids filled
Engine-size: the volume (cubic in.) of fuel and air that can be pushed through a car’s cylinders
Bore: diameter (in.) of engine’s cylinder
Stroke: depth (in.) of engine’s cylinder
Compression-ratio: ratio measuring how much cylinder volume is able to be compressed
Horsepower: the power an engine produces (550 ft-lbs per second)
Peak-RPM: the max speed an engine can spin (rotations per second)
From these two histograms, we see that both highway and city mpg have a relatively symmetric distribution. Although there appears to be a slight skew to the right, this skew is not significant enough to deny a normal distribution. In regard to shape, both histograms are similar with three peaks around the center of the distribution.
Given these results, we see nothing that would prevent this data from being viable for linear regression.
From these two correlation plots, we see that wheelbase, length, width, curb-weight, engine size, bore, and horsepower all have strong negative correlations with both highway and city mpg. However, many of these explanatory variables have strong positive correlations with each other, which could signify the presence of colinearity.
Thus, we should move forward with LASSO (Least Absolute Selection and Shrinkage Operator) model selection, which is a commonly-used remedy for regression models that posses colinearity.
14 x 1 sparse Matrix of class "dgCMatrix"
s0
(Intercept) 3.395918097
fueltype .
aspiration .
wheelbase .
length .
width 0.004117396
height .
curbweight -0.190851601
enginesize -0.019474681
bore 0.005619811
stroke 0.003983555
compressionratio 0.069151866
horsepower .
peakrpm -0.012633210
14 x 1 sparse Matrix of class "dgCMatrix"
s0
(Intercept) 3.195646783
fueltype .
aspiration .
wheelbase .
length -0.007060336
width .
height .
curbweight -0.145091194
enginesize .
bore .
stroke .
compressionratio 0.080345176
horsepower -0.074876843
peakrpm -0.002452953
After transforming the response variable several times, I found that a logarithmic transformation worked best for both highway and city mpg. This, along with the LASSO model selection tool led to two relatively high R^2 values for both my models.
In the LASSO model selection for both models, the lambda values were chosen based on the smallest average SSE (Sum of Squared Errors) that were derived via cross validation.
Highway MPG -
Before utilizing a logarithmic transformation, I received an R^2 for the training data of 0.8583 and for the testing data, 0.7683.
After utilizing a logarithmic transformation, the R^2 for the training data increased to 0.8774 and for the testing data, 0.8460.
An interpretation of one of the LASSO-produced coefficients would be as a vehicles curb-weight increases by one pound, highway gas mileage, on average, will decrease by 0.1909 miles/gallon.
City MPG -
Before utilizing a logarithmic transformation, I received an R^2 for the training data of 0.8762 and for the testing data, 0.7696.
After utilizing a logarithmic transformation, the R^2 for the training data increased to 0.9013 and for the testing data, 0.8749.
An interpretation of one of the LASSO-produced coefficients would be as a vehicles curb-weight increases by one pound, city gas mileage, on average, will decrease by 0.1451 miles/gallon.
Subset EDA and Model Assumptions
Anderson-Darling normality test
data: residuals1
A = 0.74013, p-value = 0.05262
Anderson-Darling normality test
data: residuals
A = 2.107, p-value = 2.214e-05
From the residual v. fitted value plots, we see that the error terms stay relatively consistent around the x-axis, rather than fanning out or showing any sign of a pattern that would imply non-constant variance. Thus, the asusmption of constant variance is not violated for both city and highway mpg.
Regarding the normality plots in the second tab, we see different results from our highway and city mpg models. For our highway mpg plot, we see a relatively linear pattern, which provides evidence that the assumption of normality is not violated. However, when we look at our city mpg plot, we see that our theoretical quantities appear to (near -1 and 1) fan outwards away from normality line. Thus, there is evidence that the assumption of normality is violated regarding our city mpg model. Additionally, these results are supported by the Anderson-Darling test results in this tab, with the highway mpg model giving a p-val > 0.05, and the city mpg failing to do so.
I attempted dataset standardization and various forms of response variable transformations for my city mpg model, but none were able to fix the violation of the normality assumption.
Call:
lm(formula = highway ~ . - city, data = vehicles)
Residuals:
Min 1Q Median 3Q Max
-7.1464 -1.6446 -0.0511 1.4150 11.1314
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 34.249759 15.839034 2.162 0.03187 *
fueltypegas 17.106769 5.802061 2.948 0.00361 **
aspirationturbo -1.617827 0.858188 -1.885 0.06098 .
wheelbase 0.129459 0.086931 1.489 0.13813
length -0.146329 0.046231 -3.165 0.00181 **
width 0.093670 0.210913 0.444 0.65748
height -0.048190 0.122517 -0.393 0.69453
curbweight -0.007331 0.001401 -5.234 4.47e-07 ***
enginesize -0.022690 0.015249 -1.488 0.13846
bore -0.528187 1.028506 -0.514 0.60818
stroke 1.828172 0.744791 2.455 0.01503 *
compressionratio 1.798445 0.399610 4.501 1.20e-05 ***
horsepower -0.003549 0.015219 -0.233 0.81587
peakrpm -0.001914 0.000602 -3.180 0.00173 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.713 on 185 degrees of freedom
Multiple R-squared: 0.8556, Adjusted R-squared: 0.8454
F-statistic: 84.3 on 13 and 185 DF, p-value: < 2.2e-16
Call:
lm(formula = highway ~ . - highway, data = vehicles)
Residuals:
Min 1Q Median 3Q Max
-5.1984 -0.5271 0.0999 0.8444 3.4824
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14.1770035 8.0570108 1.760 0.0801 .
city 0.9537563 0.0410704 23.222 < 2e-16 ***
fueltypegas -2.4840675 3.0532233 -0.814 0.4169
aspirationturbo -1.9541253 0.4342658 -4.500 1.20e-05 ***
wheelbase -0.0679359 0.0447789 -1.517 0.1309
length 0.0379411 0.0246909 1.537 0.1261
width 0.0565314 0.1066799 0.530 0.5968
height -0.0437759 0.0619626 -0.706 0.4808
curbweight -0.0019281 0.0007456 -2.586 0.0105 *
enginesize -0.0318438 0.0077221 -4.124 5.64e-05 ***
bore 0.1756890 0.5210439 0.337 0.7364
stroke 0.7979347 0.3792776 2.104 0.0368 *
compressionratio -0.0888076 0.2178282 -0.408 0.6840
horsepower 0.0393347 0.0079155 4.969 1.53e-06 ***
peakrpm -0.0006919 0.0003090 -2.239 0.0263 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.372 on 184 degrees of freedom
Multiple R-squared: 0.9633, Adjusted R-squared: 0.9605
F-statistic: 344.5 on 14 and 184 DF, p-value: < 2.2e-16
Simple Linear Regression Model Explanation and Conclusion
As seen in the model summaries to the left, using simple linear
regression to model the data actually creates two strong R^2 values -
0.8556 for highway mpg, and 0.9633 for city mpg. However, the presence
of colinearity in this model, as shown under
Additionally, when comparing linearity, we see that these models’ error terms do not stay as consistent around the x-axis as the error terms from our log-LASSO models
Project Conclusion
We succeeded in creating a model that eliminates 84.60% of the total variability in highway mpg, and does not violate our constant-variance and normality assumptions.
However, even though we found a model that eliminates 87.49% of the total variability in city mpg and does not violate our constant variance assumption, the log-LASSO model for city mpg violated our normality assumption. This ultimately raises question to the validity of this model and its prediction abilities.
---
title: "MPG Regression Analysis"
author: "Jesse Devitt"
output:
flexdashboard::flex_dashboard:
theme:
version: 4
bootswatch: cosmo
primary: "blue"
orientation: columns
vertical_layout: fill
source_code: embed
---
<style>
.chart-title { /* chart_title */
font-size: 20px;
}
body{ /* Normal */
font-size: 18px;
}
</style>
```{r setup, include=FALSE}
library(flexdashboard)
library(shiny)
library(shinydashboard)
```
Introduction
===
<head>
<base target = "_blank">
</head>
<font size=5>
**Vehicle Attributes and Effects on MPG**
</font>
Column {data-width=650}
-----------------------------------------------------------------------
### Motivation
The purpose of this study to was to create a linear regression model based on vehicle data from 1985 Ward's Automotive Yearbook in hopes of uncovering possible relationships between MPG (miles per gallon) and other vehicle attributes.
Through separating city and highway MPG, I hoped to uncover possible differences between how the two rates are impacted by the different vehicle attributes given in the data.
The dataset in this study contains 205 observations, 6 of which were removed due to missing values, leading to grand total of 199 observations.
Additionally, the dataset originally contained 26 variables, but I chose to only study 15 of these, subetting the data based upon high amounts of missing values and relevancy to the research question.
```{r}
knitr::opts_chunk$set(echo = TRUE)
library(pacman)
library(tidyverse)
library(plotly)
library(corrplot)
library(RColorBrewer)
library(stats)
vehicles <- read.csv("~/Library/Mobile Documents/com~apple~CloudDocs/MTH 369/automobile/imports-85.data", header=FALSE)
vehicles <- vehicles %>% dplyr::select(-c(V1, V2, V3, V6, V7, V8, V9, V15, V16, V18, V26))
names(vehicles) <- c("fueltype", "aspiration", "wheelbase", "length", "width", "height", "curbweight", "enginesize", "bore", "stroke", "compressionratio", "horsepower", "peakrpm", "city", "highway")
vehicles$city <- as.numeric(vehicles$city)
vehicles$highway <- as.numeric(vehicles$highway)
vehicles$curbweight <- as.numeric(vehicles$curbweight)
vehicles$enginesize <- as.numeric(vehicles$enginesize)
vehicles$bore <- as.numeric(vehicles$bore)
vehicles$stroke <- as.numeric(vehicles$stroke)
vehicles$compressionratio <- as.numeric(vehicles$compressionratio)
vehicles$horsepower <- as.numeric(vehicles$horsepower)
vehicles$peakrpm <- as.numeric(vehicles$peakrpm)
vehicles$peakrpm[vehicles$peakrpm == "?"] <- NA
vehicles$horsepower[vehicles$horsepower == "?"] <- NA
vehicles$stroke[vehicles$stroke == "?"] <- NA
vehicles$bore[vehicles$bore == "?"] <- NA
vehicles$fueltype <- as.factor(vehicles$fueltype)
vehicles$aspiration <- as.factor(vehicles$aspiration)
vehicles <- vehicles[complete.cases(vehicles),]
vehicles <- vehicles[, c("city", names(vehicles)[-which(names(vehicles) == "city")])]
vehicles <- vehicles[, c("highway", names(vehicles)[-which(names(vehicles) == "highway")])]
standardized <- apply(vehicles[, 5:15], 2, function(x) (x-mean(x)) / sd(x))
v <- vehicles %>% dplyr::select(highway, city, fueltype, aspiration)
stan_vehicles <- cbind.data.frame(v, standardized)
knitr::kable(vehicles[1:10,])
```
Column {data-width=350}
-----------------------------------------------------------------------
### Variable Index
The following explanatory variables were the focus of our analysis:
- Fuel Type: gas or diesel
- Aspiration: standard (std) or turbo
- Wheel Base: the horizontal distance (in.) between the centers of the front and rear wheels
- Length: length (in.) of vehicle
- Width: width (in.) of vehicle
- Height: height (in.) of vehicle
- Curb-weight: the published weight (lbs.) of a vehicle with a full tank of fuel and all fluids filled
- Engine-size: the volume (cubic in.) of fuel and air that can be pushed through a car's cylinders
- Bore: diameter (in.) of engine's cylinder
- Stroke: depth (in.) of engine's cylinder
- Compression-ratio: ratio measuring how much cylinder volume is able to be compressed
- Horsepower: the power an engine produces (550 ft-lbs per second)
- Peak-RPM: the max speed an engine can spin (rotations per second)
Response Variable EDA
===
Column {.tabset data-width=650}
---
### Highway
```{r}
ggplot(vehicles, aes(x = highway)) + geom_histogram(color = "white", fill = "darkred") + labs(x = "Highway MPG", y = "Count of Vehicles", title = "Distribution of Highway MPG") + theme_classic()
```
### City
```{r}
ggplot(vehicles, aes(x = city)) + geom_histogram(color = "white", fill = "blue") + labs(x = "City MPG", y = "Count of Vehicles", title = "Distribution of City MPG") + theme_classic()
```
Column {data-width=350}
---
### Explanation
From these two histograms, we see that both highway and city mpg have a relatively symmetric distribution. Although there appears to be a slight skew to the right, this skew is not significant enough to deny a normal distribution. In regard to shape, both histograms are similar with three peaks around the center of the distribution.
Given these results, we see nothing that would prevent this data from being viable for linear regression.
Correlation Exploration
===
Column {.tabset data-width=650}
---
### Highway
```{r, echo=FALSE}
highwaynumeric <- vehicles %>% select(-c(fueltype, aspiration, city))
m1 <- round(cor(highwaynumeric), 2)
corrplot(m1, method = c("number"),type="upper",main="Highway MPG",mar=c(0,0,1,0), number.cex = 0.5)
```
### City
```{r, echo=FALSE}
citynumeric <- vehicles %>% select(-c(fueltype, aspiration, highway))
m <- round(cor(citynumeric), 2)
corrplot(m, method = c("number"),type="upper",main="City MPG",mar=c(0,0,1,0), number.cex = 0.5)
```
Column {data-width=350}
---
### Explanation of Collinearity
From these two correlation plots, we see that wheelbase, length, width, curb-weight, engine size, bore, and horsepower all have strong negative correlations with both highway and city mpg. However, many of these explanatory variables have strong positive correlations with each other, which could signify the presence of colinearity.
Thus, we should move forward with LASSO (Least Absolute Selection and Shrinkage Operator) model selection, which is a commonly-used remedy for regression models that posses colinearity.
Model Selection
===
Column {data-width=300}
---
### Lambda Estimate for Highway Model
```{r, fig.align='center', echo=FALSE}
x<-as.matrix(stan_vehicles[,3:15])
y1<-log(vehicles$highway)
set.seed(2000)
proportion_split<-0.7
train<-sample(1:nrow(x), round(nrow(x)*proportion_split))
y1.train<-y1[train]
y1.test<-y1[-train]
x.train<-x[train,]
x.test<-x[-train,]
library(glmnet)
set.seed(2000)
cv.lasso1<-cv.glmnet(x.train, y1.train, alpha = 1)
#cv.lasso1$lambda.min
plot(cv.lasso1)
```
### Reduced Highway MPG Model
```{r, fig.align='center', echo=FALSE}
model1<-glmnet(x.train, y1.train, alpha = 1, lambda = cv.lasso1$lambda.min)
coef1<-coef(model1)
#to compute training SSE from LASSO regression
y_predictedtrain1 <- predict(model1, s = cv.lasso1$lambda.min, newx = x.train)
SSEtrain1<-sum((y_predictedtrain1-y1.train)^2)
residuals1 <- y_predictedtrain1 - y1.train
#Computing R-squared
SSTOtrain1<-sum((y1.train-mean(y1.train))^2)
R2train1<-1-SSEtrain1/SSTOtrain1
#to compute testing SSE from LASSO regression
y_predictedtest1 <- predict(model1, s = cv.lasso1$lambda.min, newx = x.test)
SSEtest1<-sum((y_predictedtest1-y1.test)^2)
#Computing R-squared
SSTOtest1<-sum((y1.test-mean(y1.test))^2)
R2test1<-1-SSEtest1/SSTOtest1
print(coef1)
```
Column {data-width=300}
---
### Estimated Lambda for City Model
```{r, fig.align='center', echo=FALSE}
library(MASS)
#bc<-boxcox(city~peakrpm+horsepower+compressionratio+curbweight+length, data = vehicles)
#lambda<-bc$x[which.max(bc$y)]
x<-as.matrix(stan_vehicles[,3:15])
y<-log(vehicles$city)
set.seed(2000)
proportion_split<-0.7
train<-sample(1:nrow(x), round(nrow(x)*proportion_split))
y.train<-y[train]
y.test<-y[-train]
x.train<-x[train,]
x.test<-x[-train,]
library(glmnet)
set.seed(2000)
cv.lasso<-cv.glmnet(x.train, y.train, alpha = 1)
#cv.lasso$lambda.min
plot(cv.lasso)
```
### Reduced City MPG Model
```{r, fig.align='center', echo=FALSE}
model<-glmnet(x.train, y.train, alpha = 1, lambda = cv.lasso$lambda.min)
coef<-coef(model)
#to compute training SSE from LASSO regression
y_predictedtrain <- predict(model, s = cv.lasso$lambda.min, newx = x.train)
SSEtrain<-sum((y_predictedtrain-y.train)^2)
residuals<-y_predictedtrain - y.train # fitted values are y_predicted
#Computing R-squared
SSTOtrain<-sum((y.train-mean(y.train))^2)
R2train<-1-SSEtrain/SSTOtrain
#to compute testing SSE from LASSO regression
y_predictedtest <- predict(model, s = cv.lasso$lambda.min, newx = x.test)
SSEtest<-sum((y_predictedtest-y.test)^2)
#Computing R-squared
SSTOtest<-sum((y.test-mean(y.test))^2)
R2test<-1-SSEtest/SSTOtest
print(coef)
```
Column {data-width=400}
---
### Explanation
After transforming the response variable several times, I found that a logarithmic transformation worked best for both highway and city mpg. This, along with the LASSO model selection tool led to two relatively high R^2 values for both my models.
In the LASSO model selection for both models, the lambda values were chosen based on the smallest average SSE (Sum of Squared Errors) that were derived via cross validation.
Highway MPG -
- Before utilizing a logarithmic transformation, I received an R^2 for the training data of 0.8583 and for the testing data, 0.7683.
- After utilizing a logarithmic transformation, the R^2 for the training data increased to 0.8774 and for the testing data, 0.8460.
- An interpretation of one of the LASSO-produced coefficients would be as a vehicles curb-weight increases by one pound, highway gas mileage, on average, will decrease by 0.1909 miles/gallon.
City MPG -
- Before utilizing a logarithmic transformation, I received an R^2 for the training data of 0.8762 and for the testing data, 0.7696.
- After utilizing a logarithmic transformation, the R^2 for the training data increased to 0.9013 and for the testing data, 0.8749.
- An interpretation of one of the LASSO-produced coefficients would be as a vehicles curb-weight increases by one pound, city gas mileage, on average, will decrease by 0.1451 miles/gallon.
Model Assumptions
===
**Subset EDA and Model Assumptions**
Column {.tabset data-width=400}
---
### Peak RPM
```{r, fig.align='center', echo=FALSE}
ggplot(vehicles, aes(x = peakrpm)) + geom_point(aes(y = city, color = "City MPG"), size = 1) + geom_point(aes(y = highway, color = "Highway MPG"), size = 1 ) + scale_color_manual(values = c("City MPG" = "blue", "Highway MPG" = "darkred")) + labs(x = "Peak RPM (rotations/second)", y = "MPG", title = "Relationship Between MPG and Corresponding Peak RPM") + theme_classic()
```
### Horsepower
```{r, fig.align='center', echo=FALSE}
ggplot(vehicles, aes(x = horsepower)) + geom_point(aes(y = city, color = "City MPG"), size = 1) + geom_point(aes(y = highway, color = "Highway MPG"), size = 1 ) + scale_color_manual(values = c("City MPG" = "blue", "Highway MPG" = "darkred")) + labs(x = "Horsepower (550ft-lbs/second)", y = "MPG", title = "Relationship Between MPG and Corresponding Horsepower") + theme_classic()
```
### Compression Ratio
```{r, fig.align='center', echo=FALSE}
ggplot(vehicles, aes(x = compressionratio)) + geom_point(aes(y = city, color = "City MPG"), size = 1) + geom_point(aes(y = highway, color = "Highway MPG"), size = 1 ) + scale_color_manual(values = c("City MPG" = "blue", "Highway MPG" = "darkred")) + labs(x = "Compression Ratio", y = "MPG", title = "Relationship Between MPG and Corresponding Compression Ratio") + theme_classic()
```
### Curb Weight
```{r, fig.align='center', echo=FALSE}
ggplot(vehicles, aes(x = curbweight)) + geom_point(aes(y = city, color = "City MPG"), size = 1) + geom_point(aes(y = highway, color = "Highway MPG"), size = 1 ) + labs(x = "Curb Weight (lbs)", y = "MPG", title = "Relationship Between MPG and Corresponding Curb Weight") + scale_color_manual(values = c("City MPG" = "blue", "Highway MPG" = "darkred")) + theme_classic()
```
### Length
```{r, fig.align='center', echo=FALSE}
ggplot(vehicles, aes(x = length)) + geom_point(aes(y = city, color = "City MPG"), size = 1) + scale_color_manual(values = c("City MPG" = "blue")) + labs(x = "Length (in)", y = "City MPG", title = "Relationship Between City MPG and Corresponding Length") + theme_classic() + theme(legend.position = "none")
```
### Stroke
```{r, fig.align='center', echo=FALSE}
ggplot(vehicles, aes(x = stroke)) + geom_point(aes(y = highway, color = "Highway MPG"), size = 1 ) + scale_color_manual(values = c("Highway MPG" = "darkred")) + labs(x = "Stroke (in)", y = "Highway MPG", title = "Relationship Between Highway MPG and Corresponding Stroke") + theme_classic() + theme(legend.position = "none")
```
### Bore
```{r, fig.align='center', echo=FALSE}
ggplot(vehicles, aes(x = bore)) + geom_point(aes(y = highway, color = "Highway MPG"), size = 1 ) + scale_color_manual(values = c("Highway MPG" = "darkred")) + labs(x = "Bore (in)", y = "Highway MPG", title = "Relationship Between Highway MPG and Corresponding Bore") + theme_classic() + theme(legend.position = "none")
```
### Engine Size
```{r, fig.align='center', echo=FALSE}
ggplot(vehicles, aes(x = enginesize)) + geom_point(aes(y = highway, color = "Highway MPG"), size = 1 ) + scale_color_manual(values = c("Highway MPG" = "darkred")) + labs(x = "Engine Size (cubic in)", y = "Highway MPG", title = "Relationship Between Highway MPG and Corresponding Engine Size") + theme_classic() + theme(legend.position = "none")
```
Column {.tabset data-width=600}
---
### Linearity
```{r, fig.align='center', echo=FALSE, out.width="50%"}
plot(residuals1~y_predictedtrain1, xlab = "Fitted Values", ylab = "Residuals", main = "Highway MPG", col = "darkred")
abline(h=0)
plot(residuals~y_predictedtrain, xlab = "Fitted Values", ylab = "Residuals", main = "City MPG", col = "blue")
abline(h=0)
```
### Normality
```{r, fig.align='center', echo=FALSE, out.width="50%"}
library(nortest)
qqnorm(residuals1, main = "Normal Q-Q Plot of Highway MPG")
qqline(residuals1, col = "darkred")
qqnorm(residuals, main = "Normal Q-Q Plot of City MPG")
qqline(residuals, col = "blue")
```
### A-D Test and Conclusions
```{r, fig.align='center', echo=FALSE, out.width="50%"}
ad.test(residuals1) #highway
ad.test(residuals) #city
```
From the residual v. fitted value plots, we see that the error terms stay relatively consistent around the x-axis, rather than fanning out or showing any sign of a pattern that would imply non-constant variance. Thus, the asusmption of constant variance is not violated for both city and highway mpg.
Regarding the normality plots in the second tab, we see different results from our highway and city mpg models. For our highway mpg plot, we see a relatively linear pattern, which provides evidence that the assumption of normality is not violated. However, when we look at our city mpg plot, we see that our theoretical quantities appear to (near -1 and 1) fan outwards away from normality line. Thus, there is evidence that the assumption of normality is violated regarding our city mpg model. Additionally, these results are supported by the Anderson-Darling test results in this tab, with the highway mpg model giving a p-val > 0.05, and the city mpg failing to do so.
I attempted dataset standardization and various forms of response variable transformations for my city mpg model, but none were able to fix the violation of the normality assumption.
Other Models
===
Column {.tabset data-width=600}
---
### Highway Base Model
```{r, fig.align='center', echo=FALSE}
base_highway <- lm(highway~.-city, data = vehicles)
summary(base_highway)
```
### City Base Model
```{r, fig.align='center', echo=FALSE}
base_city <- lm(highway~.-highway, data = vehicles)
summary(base_city)
```
### Assumptions
```{r, fig.align='center', echo=FALSE, out.width="50%"}
plot(base_highway$fitted.values, base_highway$residuals, col = "darkred", xlab = "Fitted Values", ylab = "Residuals", main = "Highway MPG")
abline(h=0)
plot(base_city$fitted.values, base_city$residuals, col = "blue", xlab = "Fitted Values", ylab = "Residuals", main = "City MPG")
abline(h=0)
```
Column {data-width=400}
---
**Simple Linear Regression Model Explanation and Conclusion**
As seen in the model summaries to the left, using simple linear regression to model the data actually creates two strong R^2 values - 0.8556 for highway mpg, and 0.9633 for city mpg. However, the presence of colinearity in this model, as shown under <Correlation Exploration> deter simple linear regression from being a viable way to model the data.
Additionally, when comparing linearity, we see that these models' error terms do not stay as consistent around the x-axis as the error terms from our log-LASSO models
**Project Conclusion**
We succeeded in creating a model that eliminates 84.60% of the total variability in highway mpg, and does not violate our constant-variance and normality assumptions.
However, even though we found a model that eliminates 87.49% of the total variability in city mpg and does not violate our constant variance assumption, the log-LASSO model for city mpg violated our normality assumption. This ultimately raises question to the validity of this model and its prediction abilities.